Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llama : tokenizer unicode codepoint categories #8606

Open
wants to merge 30 commits into
base: master
Choose a base branch
from

Conversation

jaime-m-p
Copy link
Collaborator

@jaime-m-p jaime-m-p commented Jul 20, 2024

Add all unicode categories to unicode-data.cpp.

Currently we are limited to high categories:

  • C, L, M, N, P, S, Z.

This PR allows access to subcategories:

  • Cn, Cc, Cf, Co, Cs, Ll, Lm, Lo, Lt, Lu, Mc, Me, Mn, Nd, Nl, No, Pc, Pd, Pe, Pf, Pi, Po, Ps, Sc, Sk, Sm, So, Zl, Zp, Zs.

Related PR: #8579, regex using Lu, Lt, Lm, Lo, etc.

TODO: Add more comments to explain the unicode regex collapse trick for all subcategories.


@github-actions github-actions bot added script Script related testing Everything test related python python script changes labels Jul 20, 2024
@compilade
Copy link
Collaborator

compilade commented Jul 21, 2024

Nice! This should also help fix (at least part of) Falcon's tokenization, because the Punctuation pre-tokenizer type uses the Po category and not the broader P one.

(ref: https://github.com/huggingface/tokenizers/blob/4ea2f235b0430f5db09f867b65306d6c0a5ec7ed/tokenizers/src/pre_tokenizers/punctuation.rs#L8, which uses Rust's is_ascii_punctuation and is_punctuation)

Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: Implement unicode regex collapse trick for all subcategories.

Do you expect any problems with this? Probably we will run out of ASCII characters for the k_ucat_cpt map:

llama.cpp/src/unicode.cpp

Lines 647 to 651 in 50e0535

static const std::map<int, int> k_ucat_cpt = {
{ codepoint_flags::NUMBER, 0xD1 },
{ codepoint_flags::LETTER, 0xD2 },
{ codepoint_flags::PUNCTUATION, 0xD3 },
};

Though we could dynamically generate the map based only on the used subcategories in the current regex

@mofosyne mofosyne added the Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level label Jul 22, 2024
@ggerganov
Copy link
Owner

The src/llama.cpp conflict should be easy to resolve - just accept the new src/llama.cpp and apply the same changes to src/llama-vocab.cpp instead

@jaime-m-p
Copy link
Collaborator Author

jaime-m-p commented Jul 25, 2024

TODO: Implement unicode regex collapse trick for all subcategories.

Do you expect any problems with this?

More problems than I thought:

  • Need +29 collapse codepoints for subcategories.
  • Ranges of collapse codepoints, ie: \p{L} --> \p{Ll} to \p{Lu} (Ll, Lm, Lo, Lt, Lu).
  • Collapse codepoint for unicode whitespaces to fix the \s problem (std::regex ignores non-ASCII \s).
    • Take care of \S and regex lookaheads, ie: (?!\S).

@jaime-m-p
Copy link
Collaborator Author

jaime-m-p commented Jul 25, 2024

I tested (subset of the brute-force tests) all available BPE models, including tekken. Same results as before this PR.
Also tested the original tekken regex and seems correct too.

The reimplementation is not very understandable without context.
I want to add more comments and try to explain all steps/blocks of code.

Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the assert in unicode.cpp should be changed to GGML_ABORT or GGML_ASSERT

src/unicode.cpp Outdated Show resolved Hide resolved
src/unicode.cpp Outdated Show resolved Hide resolved
jaime-m-p added 12 commits August 5, 2024 23:55
- Reorganize category/subcategory bits.
- Regex flags for \s \w \d.
- Using std::basic_regex.
- Custom std::ctype specialization for 32bits codepoints.
- Custom std::regex_traits specialization for 32bits codepoints.
- Implementing custom 'character class expression' for \p{Xx}.
- Single pass regex preparation.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
python python script changes Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level script Script related testing Everything test related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants